{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#用阈值创建二元特征"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在前一个主题，我们介绍了数据转换成标准正态分布的方法。现在，我们看看另一种完全不同的转换方法。\n",
    "\n",
    "当不需要呈标准化分布的数据时，我们可以不处理它们直接使用；但是，如果有足够理由，直接使用也许是聪明的做法。通常，尤其是处理连续数据时，可以通过建立二元特征来分割数据。\n",
    "\n",
    "<!-- TEASER_END -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Getting ready"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通常建立二元特征是非常有用的方法，不过要格外小心。我们还是用`boston`数据集来学习如何创建二元特征。\n",
    "\n",
    "首先，加载`boston`数据集："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "boston = datasets.load_boston()\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How to do it..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与标准化处理类似，scikit-learn有两种方法二元特征：\n",
    "\n",
    "- `preprocessing.binarize`（一个函数）\n",
    "- `preprocessing.Binarizer`（一个类）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`boston`数据集的因变量是房子的价格中位数（单位：千美元）。这个数据集适合测试回归和其他连续型预测算法，但是假如现在我们想预测一座房子的价格是否高于总体均值。要解决这个问题，我们需要创建一个均值的阈值。如果一个值比均值大，则为`1`；否则，则为`0`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.,  0.,  1.,  1.,  1.])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import preprocessing\n",
    "new_target = preprocessing.binarize(boston.target, threshold=boston.target.mean())\n",
    "new_target[0,:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "很容易，让我们检查一下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 0, 1, 1, 1])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(boston.target[:5] > boston.target.mean()).astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "既然Numpy已经很简单了，为什么还要用scikit-learn的函数呢？管道命令，将在*用管道命令联接多个预处理步骤*一节中介绍，会解释这个问题；要用管道命令就要用`Binarizer`类："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.,  0.,  1.,  1.,  1.])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bin = preprocessing.Binarizer(boston.target.mean())\n",
    "new_target = bin.fit_transform(boston.target)\n",
    "new_target[0,:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How it works..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "方法看着非常简单；其实scikit-learn在底层创建一个检测层，如果被监测的值比阈值大就返回`Ture`。然后把满足条件的值更新为`1`，不满足条件的更新为`0`。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##There's more..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "让我们再介绍一些稀疏矩阵和`fit`方法的知识。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###稀疏矩阵"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "稀疏矩阵的`0`是不被存储的；这样可以节省很多空间。这就为`binarizer`造成了问题，需要指定阈值参数`threshold`不小于`0`来解决，如果`threshold`小于`0`就会出现错误："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Cannot binarize a sparse matrix with threshold < 0",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-31-c9b5156c63ab>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[1;32mfrom\u001b[0m \u001b[0mscipy\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msparse\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mcoo\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[0mspar\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcoo\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcoo_matrix\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mbinomial\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m.25\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m100\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mpreprocessing\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mbinarize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mspar\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mthreshold\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m-\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32md:\\programfiles\\Miniconda3\\lib\\site-packages\\sklearn\\preprocessing\\data.py\u001b[0m in \u001b[0;36mbinarize\u001b[1;34m(X, threshold, copy)\u001b[0m\n\u001b[0;32m    718\u001b[0m     \u001b[1;32mif\u001b[0m \u001b[0msparse\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0missparse\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    719\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mthreshold\u001b[0m \u001b[1;33m<\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 720\u001b[1;33m             raise ValueError('Cannot binarize a sparse matrix with threshold '\n\u001b[0m\u001b[0;32m    721\u001b[0m                              '< 0')\n\u001b[0;32m    722\u001b[0m         \u001b[0mcond\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdata\u001b[0m \u001b[1;33m>\u001b[0m \u001b[0mthreshold\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mValueError\u001b[0m: Cannot binarize a sparse matrix with threshold < 0"
     ]
    }
   ],
   "source": [
    "from scipy.sparse import coo\n",
    "spar = coo.coo_matrix(np.random.binomial(1, .25, 100))\n",
    "preprocessing.binarize(spar, threshold=-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###`fit`方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`binarizer`类里面有`fit`方法，但是它只是通用接口，并没有实际的拟合操作，仅返回对象。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
